Loading all the required Packages

library(ggplot2)
library(GGally)
library(anytime)
library(dplyr)
library(caret)
library(ROCR)
library(scales)
library(randomForest)
library(gridExtra)
library(sqldf)
library(plotly)
library(scales)
library(plyr)

Pre-Processing Data

Reading Data files and looking at Data summary

unit.data = read.csv('Unit_Level.csv')
lot.data = read.csv('Lot_Level.csv')
summary(unit.data)
##          UNIT_ID                    UNIT_PROCESS_DATE    PARAMETER1   
##  001H5KA42Xk5:     1   01/24/2018 05:29:18 AM:  4552   Min.   :0.210  
##  002rDzPJZV2c:     1   01/24/2018 09:44:42 AM:  4532   1st Qu.:2.290  
##  002SIV9dq2dw:     1   01/24/2018 11:18:32 PM:  3420   Median :2.790  
##  004jlyVK8xcR:     1   01/27/2018 02:13:01 PM:  3417   Mean   :2.855  
##  004JohgXsn8M:     1   01/27/2018 04:35:09 PM:  3411   3rd Qu.:3.390  
##  007Mn160JQuT:     1   01/28/2018 07:48:32 AM:  3402   Max.   :6.920  
##  (Other)     :276957   (Other)               :254229   NA's   :6572   
##    PARAMETER2       PARAMETER3       PARAMETER4         PARAMETER5      
##  Min.   :-10.13   Min.   :-1.680   Min.   :-554.420   Min.   :-573.610  
##  1st Qu.: -6.75   1st Qu.:-0.012   1st Qu.:  -8.470   1st Qu.:  -5.710  
##  Median : -5.80   Median : 0.000   Median :  -1.350   Median :   0.620  
##  Mean   : -5.75   Mean   : 0.000   Mean   :  -0.006   Mean   :   0.009  
##  3rd Qu.: -4.78   3rd Qu.: 0.012   3rd Qu.:   6.940   3rd Qu.:   5.975  
##  Max.   : 26.09   Max.   : 1.716   Max.   : 635.510   Max.   : 361.370  
##  NA's   :6572     NA's   :6572     NA's   :6572       NA's   :6572      
##    PARAMETER6       PARAMETER7       PARAMETER8       PARAMETER9    
##  Min.   : 623.0   Min.   :-112.2   Min.   :-0.091   Min.   :-600.9  
##  1st Qu.: 889.0   1st Qu.: 608.0   1st Qu.:-0.023   1st Qu.:-387.5  
##  Median : 933.0   Median : 696.7   Median :-0.020   Median :-340.2  
##  Mean   : 934.2   Mean   : 693.3   Mean   :-0.021   Mean   :-335.6  
##  3rd Qu.: 976.0   3rd Qu.: 780.4   3rd Qu.:-0.017   3rd Qu.:-288.2  
##  Max.   :3494.0   Max.   :2576.0   Max.   : 0.000   Max.   : 799.2  
##  NA's   :6572     NA's   :6572     NA's   :6572     NA's   :6572    
##   PARAMETER10        PARAMETER11       PARAMETER12      PARAMETER13    
##  Min.   :-1862.06   Min.   :-544.82   Min.   :  0.00   Min.   : 17.90  
##  1st Qu.: -613.40   1st Qu.:-276.11   1st Qu.: 10.34   1st Qu.: 67.16  
##  Median : -550.96   Median :-247.96   Median : 17.25   Median : 87.58  
##  Mean   : -567.03   Mean   :-246.36   Mean   : 19.44   Mean   : 90.66  
##  3rd Qu.: -506.06   3rd Qu.:-218.71   3rd Qu.: 25.41   3rd Qu.:110.87  
##  Max.   :  -35.68   Max.   :  51.14   Max.   :494.32   Max.   :524.33  
##  NA's   :6572       NA's   :6572      NA's   :6572     NA's   :6572    
##  UNIT_CARRIER_POS_X UNIT_CARRIER_POS_Y RESPONSE_FLAG    
##  Min.   :0          Min.   :0.0        Min.   :0.00000  
##  1st Qu.:1          1st Qu.:0.0        1st Qu.:0.00000  
##  Median :2          Median :1.0        Median :0.00000  
##  Mean   :2          Mean   :0.5        Mean   :0.00682  
##  3rd Qu.:3          3rd Qu.:1.0        3rd Qu.:0.00000  
##  Max.   :4          Max.   :1.0        Max.   :1.00000  
##  NA's   :5236       NA's   :5236
summary(lot.data)
##          UNIT_ID                LOT_ID       MATERIAL1_SUPPLIER
##  001H5KA42Xk5:     1   X_04D213M804:  4552       : 98128       
##  002rDzPJZV2c:     1   X_04D284M804:  4532   MFG1: 30293       
##  002SIV9dq2dw:     1   X_04D486M804:  3420   MFG2:148542       
##  004jlyVK8xcR:     1   X_04E339M804:  3417                     
##  004JohgXsn8M:     1   X_04E374M804:  3411                     
##  007Mn160JQuT:     1   X_05C286M805:  3402                     
##  (Other)     :276957   (Other)     :254229                     
##  MATERIAL1_SUPPLIER_LOT_ID MATERIAL2_SUPPLIER MATERIAL2_SUPPLIER_FACILITY
##            :126596              :  5236       .    :  5236               
##  F21ZZ2364A:   790         Tech1:  4365       SE   : 84621               
##  F21ZZ2364S:  9948         Tech2: 89997       SW   :  5376               
##  F21ZZ2364Y:  1102         Tech3:177365       Tech1:  4365               
##  F21ZZ2365U: 13372                            Tech3:177365               
##  MIXED     :125155                                                       
##                                                                          
##  MATERIAL3_SUPPLIER MATERIAL3_SUPPLIER_LOT_ID FAJ_TOOL_ID    
##       : 70754                 :70754                : 70754  
##  Tech1:206209       JZ7745.V45:48373          FAJ011:  1770  
##                     JZ7743.V52:45317          FAJ015:   490  
##                     JZ7767.V21:41985          FAJ211:198175  
##                     JZ7746.V35:18250          FAJ213:  4580  
##                     JZ7743.V51:13742          MIXED :  1194  
##                     (Other)   :38542                         
##    VD_TOOL_ID     VD_TOOL_LANE     NX_TOOL_ID     NX_TOOL_COMPARTMENT
##  VDZ001 :108559   .    :  5236   NXG1911:159543         :  6572      
##  VDZ004 : 63357   BACK :135945   NXG1933: 93017   BOTTOM:127300      
##  VDZ006 : 53337   FRONT:135782   NXG1903: 21636   TOP   :143091      
##         : 39842                  NXG1901:  2249                      
##  MIXED  :  9719                  TPS004 :   406                      
##  VDZ002 :  1119                  TPS005 :   101                      
##  (Other):  1030                  (Other):    11

Checking the number of missing values in Unit Level Data

sapply(unit.data, function(col) sum(is.na(col))) %>% round(digits=2)
##            UNIT_ID  UNIT_PROCESS_DATE         PARAMETER1 
##                  0                  0               6572 
##         PARAMETER2         PARAMETER3         PARAMETER4 
##               6572               6572               6572 
##         PARAMETER5         PARAMETER6         PARAMETER7 
##               6572               6572               6572 
##         PARAMETER8         PARAMETER9        PARAMETER10 
##               6572               6572               6572 
##        PARAMETER11        PARAMETER12        PARAMETER13 
##               6572               6572               6572 
## UNIT_CARRIER_POS_X UNIT_CARRIER_POS_Y      RESPONSE_FLAG 
##               5236               5236                  0

We can observe that there are 6572 missing values in all Parameters.All 6572 values are for Response Flag=0. We have a class imbalance with just 0.68% data for class 1.Removal of these 6572 rows(for class 0) won’t impact the class imbalance. Hence, removing these values won’t impact our analysis.

Removing missing values

unit.data = na.omit(unit.data)
sapply(unit.data, function(col) sum(is.na(col))) %>% round(digits=2)
##            UNIT_ID  UNIT_PROCESS_DATE         PARAMETER1 
##                  0                  0                  0 
##         PARAMETER2         PARAMETER3         PARAMETER4 
##                  0                  0                  0 
##         PARAMETER5         PARAMETER6         PARAMETER7 
##                  0                  0                  0 
##         PARAMETER8         PARAMETER9        PARAMETER10 
##                  0                  0                  0 
##        PARAMETER11        PARAMETER12        PARAMETER13 
##                  0                  0                  0 
## UNIT_CARRIER_POS_X UNIT_CARRIER_POS_Y      RESPONSE_FLAG 
##                  0                  0                  0

Now, there are no missing values in data.

Checking Normality in data

Let us plot density plots to find out,if there is skweness in data.

All the parameters seems to be normal, except parameter 10 which is left skewed and Parameter 12 which is right skewed. These two parameters contain negative and zero values.Therefore, we wont be able to perform log or root transformations.

Finding outliers in data

Let us draw boxplots to check if there are many outliers in data.

We can observe that there are a lot of outliers in all these Parameters. Since, objective of our case study is failure analysis, outliers might play a very important role in detecting the faults. Therefore, we cannot remove them. We can scale them to 5th and 99th percntile,but that might effect our prediction too.Therefore, We need to use methods which are robust to outliers to find the best parameters affecting failure.

Feature scaling: Z Normalization

Looking at Summary of unit data, we observe that the features are in different ranges. we need to scale those features to get all values in same range.Z normalization is one of the powerful techniques to do that. This is because this normalization technique is robust to outliers.

scaled.para = as.data.frame(scale(unit.data[3:15]))
summary(scaled.para)
##    PARAMETER1         PARAMETER2         PARAMETER3       
##  Min.   :-3.53426   Min.   :-3.19849   Min.   :-56.26032  
##  1st Qu.:-0.75450   1st Qu.:-0.73029   1st Qu.: -0.40430  
##  Median :-0.08629   Median :-0.03656   Median :  0.00782  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   :  0.00000  
##  3rd Qu.: 0.71556   3rd Qu.: 0.70828   3rd Qu.:  0.39921  
##  Max.   : 5.43312   Max.   :23.25075   Max.   : 57.46839  
##    PARAMETER4          PARAMETER5          PARAMETER6      
##  Min.   :-39.16179   Min.   :-51.81561   Min.   :-4.80485  
##  1st Qu.: -0.59788   1st Qu.: -0.51658   1st Qu.:-0.69793  
##  Median : -0.09495   Median :  0.05522   Median :-0.01859  
##  Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.:  0.49062   3rd Qu.:  0.53894   3rd Qu.: 0.64531  
##  Max.   : 44.89051   Max.   : 32.64217   Max.   :39.52211  
##    PARAMETER7         PARAMETER8         PARAMETER9      
##  Min.   :-6.07041   Min.   :-12.7462   Min.   :-3.63140  
##  1st Qu.:-0.64249   1st Qu.: -0.4786   1st Qu.:-0.71046  
##  Median : 0.02553   Median :  0.1520   Median :-0.06267  
##  Mean   : 0.00000   Mean   :  0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.65648   3rd Qu.:  0.6649   3rd Qu.: 0.64946  
##  Max.   :14.18888   Max.   :  3.7858   Max.   :15.53492  
##   PARAMETER10        PARAMETER11        PARAMETER12     
##  Min.   :-16.1898   Min.   :-6.98662   Min.   :-1.6991  
##  1st Qu.: -0.5796   1st Qu.:-0.69639   1st Qu.:-0.7959  
##  Median :  0.2010   Median :-0.03743   Median :-0.1916  
##  Mean   :  0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.:  0.7623   3rd Qu.: 0.64728   3rd Qu.: 0.5216  
##  Max.   :  6.6427   Max.   : 6.96420   Max.   :41.4977  
##   PARAMETER13      
##  Min.   :-2.27954  
##  1st Qu.:-0.73627  
##  Median :-0.09653  
##  Mean   : 0.00000  
##  3rd Qu.: 0.63313  
##  Max.   :13.58647

Defect percentage for each LOT_ID by UNIT_PROCESS_DATE

Let us merge the Lot level data and unit level data based on UNIT_ID.

Calculating defect percentage and creating three subsets: Data with Defect percentage greater than 0,Defect percentage greater than 1 and Defect percentages greater than equal to 0.

Plotting Defect percentages over Time(Based on LOT_ID)

We can zoom out and zoom in graph to analyze different time frames by selecting an area on graph. Also every label indicates the Lot ID along with the timestamp at that point to indicate that which LOT_ID failed at that point in time.

For more detailed analysis, Let us plot Defect percentages greater than 1 over time.

This graph shows that 17th January to 25th January is the most critical time. It has a very high percentage of defects. We need to analyze this time frame more closely to find out the defective lots in these time frame.

This gives us more insight about which LOT_ID fails on which date. On the basis of this, we can try to analyze about tools and lot level parameters in this time frame to find out the causes of high defect rate.

Top Parameters at Unit Level associated with the increase in defective units

Fitting model

Train-Test split

Let us create a validation set to test our model.

df1 = data.frame(scaled.para,RESPONSE_FLAG = unit.data[18])
test_idx = sample(1:nrow(df1),size=floor(nrow(df1)/4))
test <- df1[test_idx,]
train = df1[-test_idx,]
train$RESPONSE_FLAG = as.factor(train$RESPONSE_FLAG)

Fitting GLM Model

We have a binary response variable which takes values 0 and 1. Therefore, we can fit Binomial Logistic Regression Model.

glm.model = glm(RESPONSE_FLAG~PARAMETER1+PARAMETER2+PARAMETER3+PARAMETER4
                +PARAMETER5+PARAMETER6+PARAMETER7+PARAMETER8+PARAMETER9+
                  PARAMETER10+PARAMETER11+PARAMETER12+PARAMETER13,data = train,
                family=binomial(link='logit'))
summary(glm.model)
## 
## Call:
## glm(formula = RESPONSE_FLAG ~ PARAMETER1 + PARAMETER2 + PARAMETER3 + 
##     PARAMETER4 + PARAMETER5 + PARAMETER6 + PARAMETER7 + PARAMETER8 + 
##     PARAMETER9 + PARAMETER10 + PARAMETER11 + PARAMETER12 + PARAMETER13, 
##     family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9189  -0.0891  -0.0564  -0.0336   4.9430  
## 
## Coefficients:
##             Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -6.51371    0.05276 -123.454  < 2e-16 ***
## PARAMETER1  -0.25467    0.06496   -3.921 8.83e-05 ***
## PARAMETER2  -0.10555    0.04041   -2.612   0.0090 ** 
## PARAMETER3  -0.02268    0.02903   -0.781   0.4346    
## PARAMETER4   0.20878    0.03407    6.128 8.90e-10 ***
## PARAMETER5   0.02706    0.03328    0.813   0.4162    
## PARAMETER6   0.04705    0.03727    1.262   0.2069    
## PARAMETER7   2.49552    0.04736   52.694  < 2e-16 ***
## PARAMETER8  -0.41794    0.03805  -10.983  < 2e-16 ***
## PARAMETER9   0.20135    0.05144    3.914 9.07e-05 ***
## PARAMETER10 -0.08609    0.03559   -2.419   0.0156 *  
## PARAMETER11 -0.80809    0.04902  -16.484  < 2e-16 ***
## PARAMETER12  0.16785    0.02189    7.669 1.73e-14 ***
## PARAMETER13 -0.81089    0.03681  -22.028  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16841.5  on 202793  degrees of freedom
## Residual deviance:  9484.1  on 202780  degrees of freedom
## AIC: 9512.1
## 
## Number of Fisher Scoring iterations: 9

Interpretating our results

The GLM fit gives us very important information about the importance of parameters. As we can observe that some parameters have a very high P value. This signifies that those parameters are not statistically significant to predict the response variable.Therefore, PARAMETER 3 ,PARAMETER 5 and PARAMETER 6 are the least important parameters with extremely high p values.Parameters 7,8,11 and 13 seems to be important parameter. We will do further analysis to see that. Furthermore, we can observe the coefficient values to predict the probability of getting a failure.If our response variable is 0,it indicates no failure and If it 1, it indicates a failure. The positive coefficient(slope) for a predictor indicates there is more probability of failure.This means that if the value of parameter13 increases by 2.5 units(on z scaling),its more likely to fail.

Analysis of variance

We can analyze the model using table of deviance which tells us that by adding each predictor, what is the amount of deviation observed.

anova(glm.model,test = 'Chisq')
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: RESPONSE_FLAG
## 
## Terms added sequentially (first to last)
## 
## 
##             Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                       202793    16841.5              
## PARAMETER1   1      0.9    202792    16840.6 0.3490849    
## PARAMETER2   1     82.8    202791    16757.8 < 2.2e-16 ***
## PARAMETER3   1      0.0    202790    16757.7 0.8660808    
## PARAMETER4   1      0.5    202789    16757.2 0.4682398    
## PARAMETER5   1      4.7    202788    16752.5 0.0304563 *  
## PARAMETER6   1     17.6    202787    16734.9 2.689e-05 ***
## PARAMETER7   1   6189.3    202786    10545.6 < 2.2e-16 ***
## PARAMETER8   1    204.3    202785    10341.3 < 2.2e-16 ***
## PARAMETER9   1      1.7    202784    10339.7 0.1977043    
## PARAMETER10  1     14.0    202783    10325.7 0.0001818 ***
## PARAMETER11  1    227.8    202782    10097.8 < 2.2e-16 ***
## PARAMETER12  1     76.0    202781    10021.8 < 2.2e-16 ***
## PARAMETER13  1    537.8    202780     9484.1 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretating our results

Difference between null deviance and residual deviance shows us that how our model is doing against the null model. Analyzing the residual deviance column, we can see that there is a sgnificant drop in deviation for Parameter 7.This signifies that it is the most important paramater. Furthermore, there is great decrease in Deviation for Parameter 8, 11 and 13. Therefore, these paramters seems to be the most important parameters to detect if there is a failure.

Correlation Matrix

We need to check if the parameters are highly correlated with each other. This will help us exclude the parameters which are highly correlated and will convey no extra information in our model.

ggpairs(unit.data,c('PARAMETER7','PARAMETER11','PARAMETER13','PARAMETER8'))

This tells us that Parameter 7 and parameter 11 are highly correlated (corr=0.715). Therefore, we will drop parameter 11 from best parameters. Thus, the best parameters found after this analysis are: Parameter 7,8 and 13.

Predicting on Validation set taking all the parameters into account

Let us check if there is a class balance in our data

summary(factor(unit.data$RESPONSE_FLAG))
##      0      1 
## 268502   1889

We can observe a high class imbalance in data. Therefore, we caanot use Accuracy as a measure for performance. ROC curve and Area under the curve will be a good measure for this study.

Performance Measurement using ROC curve

my.predictions = predict(glm.model,test,type = 'response')
pr <- prediction(my.predictions, test$RESPONSE_FLAG)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

ROC curve indicates that our model is doing well. True positive rate is increasing drasticaly. Anything above the diagonal line,indicates a good model.

Calculating Area under the curve

Area under the curve is a good measure to find our performance. AUC of 1 indicated that our model is predicting perfectly with no false positives and false negatives. AUC can vary between 0 and 1.

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8507283

Prediction based on top 3 parameters

Now, we have predicted that top 3 parameters are :PARAMETER 7,PARAMETER 8 and PARAMETER 13. Let us fit glm model and predict on the same validation set.

glm.model = glm(RESPONSE_FLAG~PARAMETER7+PARAMETER8+PARAMETER13,data = train,
                family=binomial(link='logit'))
anova(glm.model,test = 'Chisq')
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: RESPONSE_FLAG
## 
## Terms added sequentially (first to last)
## 
## 
##             Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                       202793      16842              
## PARAMETER7   1   6074.3    202792      10767 < 2.2e-16 ***
## PARAMETER8   1    265.9    202791      10501 < 2.2e-16 ***
## PARAMETER13  1    437.7    202790      10064 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
my.predictions = predict(glm.model,test,type = 'response')
pr <- prediction(my.predictions, test$RESPONSE_FLAG)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8521408

Importance of Feature Selection

We can observe that there is an increase in AUC value,if we select just three features. Therefore, this confirms that the three parameters(Parameter 7,8 and 13) can better predict if there is a failure or not.

Let us look at the scatter plots of different Parameters.

Parameter7 vs Parameter13

unscaled_joined.data = merge(unit.data,lot.data,by = "UNIT_ID")
unscaled_joined.data$UNIT_PROCESS_DATE = anydate(unscaled_joined.data$UNIT_PROCESS_DATE)
ggplot(unscaled_joined.data,aes(x = PARAMETER7, y = PARAMETER13, color =factor(RESPONSE_FLAG)))+
  geom_point(alpha = 0.5)+xlab("PARAMETER7")+ ylab("PARAMETER13")+labs(color = "Did it fail?")+ scale_color_manual(values =c("#E69F00","#56B4E9"), labels =c("No", "Yes"))

We can observe that Parameter 13 values vary between 0 and 400. Similiarly, Parameter 7 have values varying between 0 to about 2500. It will be difficult to make any predictions beyond this range of values. Also higher values of Parameter 7(above 1000) signifies that there there is a failure. Therefore, we can interpret that most of the failures occur when parameter 13 has lower values and corresponding parameter 7 has higher values.

Parameter8 vs Parameter13

ggplot(unscaled_joined.data,aes(x = PARAMETER13, y = PARAMETER8, color =factor(RESPONSE_FLAG)))+
  geom_point(alpha = 0.5)+xlab("PARAMETER13")+ ylab("PARAMETER8")+labs(color = "Did it fail?")+ scale_color_manual(values =c("#E69F00","#56B4E9"), labels =c("No", "Yes"))

Also, if we plot Paramter 8 and 13 values, most of the data indicates a response of no failure. We cannot make any predictions after looking at this plot.

Parameter7 vs Parameter8

ggplot(unscaled_joined.data,aes(x = PARAMETER8, y = PARAMETER7, color =factor(RESPONSE_FLAG)))+
  geom_point(alpha = 0.5)+xlab("PARAMETER8")+ ylab("PARAMETER7")+labs(color = "Did it fail?")+ scale_color_manual(values =c("#E69F00","#56B4E9"), labels =c("No", "Yes"))

This plot signifies that higher values of Parameter 7 have more failures.There is a range of values of paramter 8 for which there is a failure(about -0.025).

Looking at the three scatter plots, we can say that parameter 7 can solely predict the failures. Therefore, to further analyze more parameters, we will check those with respective to the increase in parameter 7 values.

Other Parameters at Lot Level associated with the increase in defective units.

Since, there is a huge increase in defects from 19th to 25th January, let us analyze by dividing our data into two subsets.

FAJ_TOOL_ID Analysis

This boxplot indicates that there is some issue with Assembly line FAJ211. As we can observe that there are some failures before mid jan,but after mid january the failures have drastically increased in this assmebly line. Also, for this assembly line Parameter 7 values are large. This tells us that we need to analyze this assembly line that where have the issues occured.

Now, let us check MATERIAL1_SUPPLIER with respect to this assembly line.

We can observe the high number of outliers in FAJ211 for MFG1 and MFG2, in mid january.There are some blank values in data i.e. we have missing supplier information data. If we had data about those material suppliers, we can draw more concrete information about this parameter.

Now,let us check for MATERIAL2_SUPPLIER.

In Mid january, there seem to be a lot of failures for Tech 2 Material supplier. We can observe a large amount of outliers. There are extremely large values for parameter 7,indicating a failure. This can be analyzed as one of the main cause of more failures.

Let us analyze MATERIAL2_SUPPLIER with respect to RESPONSE_FLAG.

This confirms that Tech2 supplier products had some problem during this time period which caused an increase in failure rates.

Let us check MATERIAL2_SUPPLIER_FACILITY.

This shows us that FAJ211 TOOL_ID has some issues in SE Supplier Facility,in mid January.

Let us analyze MATERIAL3_SUPPLIER

This shows us that there are failures in mid january for FAJ211 TOOL_ID and MIXED, Tech 1 supplier. There could be some issue with this supplier during this period. There are extremely high values for TOOL_ID FAJ211 indicating that this tool failed.

VD_TOOL_ID Analysis

The box plot conveys us that there are many failures occuring in different tools on this assembly station. Since, most of the tools show a high increase in failure rates, there could be a problem in assembly station,rather than on individual tools.

Let us check if there is a problem on Front or back Lane of assembly station.

This shows that VD_TOOL_LANE is not dependent on the response variable. Both these boxplots indicate that parameter 7 have comparable values with almost same amount of outliers.

Let us check MATERIAL1_SUPPLIER

We cannot draw any conclusion about Material 1 Supplier. There is a lot of missing data with extreme outliers.

Let us analyze MATERIAL2_SUPPLIER with respect to RESPONSE_FLAG.

Many TOOL_IDs in this assembly station,seems to have a problem in Tech 2 Material Supplier,during Mid january.

Let us check MATERIAL2_SUPPLIER_FACILITY.

This shows that there is some issue with SE facility of Material 2 supplier. This caused an issue for FAJ_TOOL_ID assembly station too.

Let us analyze MATERIAL3_SUPPLIER

This also shows that there are a lot of failures in Tech 1 Material 3 supplier. Thus, this is one of the main root causes of increase in defects.

NX_TOOL_ID Analysis

Let us check NX_TOOL_ID assembly station.

This shows us that NXG1911 and NXG1933 Tool IDs have failures.There are some failures observed in TPS004 TOOL_ID in Mid january.

Let us check if these failures are in particular compartment of the assembly station.

Looking at NXG1911 and NXG1933 Tool IDs, we can observe that there is some discrepancy in the bottom compartment of NXG1911. This is one of the root causes of increase in defects in mid january. Also, in NXG1933, there seems to be some outliers in bottom compartment too. But, we cannot make strong arguments about that tool ID.

Let us check MATERIAL1_SUPPLIER

This boxplot tells us that NXG1911 has a lot of failures for missing Material 1 suppliers(for mid January)

Let us analyze MATERIAL2_SUPPLIER with respect to RESPONSE_FLAG.

This indicates that the increase in failures is due to Tech 2 Material suppliers. Since, all the three assembly lines failed due to this Material 2 supplier, it is a major cause of failures.

Let us check MATERIAL2_SUPPLIER_FACILITY.

This indicates that there is a problem in SE Faciity of Material 2 Suppliers.

Let us analyze MATERIAL3_SUPPLIER

This indicates that Tech 1 Material 3 suupplier has some issues. All the lots of Tech 1 supplier, has some problems. This is causing failures.

CONCLUSION

  1. There are various Unit Level as well as Lot Level Parameters which causes the failures. Analyzing at Unit Level, using Binary logistic regression and ANOVA test gives us the most important parameters used to determine if it is a failure. After performing the test, I came to a conclusion that Parameter 7 is the most important parameter. An increase in Parameter 7 value,gives us a higher chance of failure. Apart from this parameter, parameter 8 and 13 are also the important parameters used to determine failures.
  2. At Lot Level, we have information about the Material Suppliers and Assembly stations. Analyzing each assembly station,gives us interesting insights about the causes of failure.
  3. For FAJ_TOOL_ID assembly station, FAJ211 is one of the tool ID’s which have an high increase in defects during Mid January.
  4. For VD_TOOL_ID station, there are a lot of TOOL_Ids which show an increase in failure.
  5. For NX_TOOL_ID, there are two tools which have a higher failure rate. It is observed that all these assembly stations failed due to materials.
  6. For Material 1 supplier,there are a lot of missing values.Furthermore,data corresponding to these missing values show an increase in defects.Therefore, we need missing data to make conclusions about material 1.
  7. Tech Material 2 Supplier and SE Material facility seems to have caused a large number of failures.
  8. All the lots for Tech 1 material supplier 3 shows a high increase in defects,duing mid January. These things might have caused an issue in all the assembly stations.
  9. Also, we can observe an interesting parameter which caused a trouble. The bottom compartment on NX_TOOL_ID has a huge amount of failures.This indicates that there is definitely some problem in bottom compartment.

QUALITY CONTROL METHODS

Detection of failure at an early stage is vital.Therefore, we need to gather data start from the raw materials to the quality of final product and customer satisfaction. Thus, we require the failure points and data corresponding to that. For instance, we need data about the raw materials. This can help us in predicting the quality of raw materials. It is likely to cause more failures on the assembly station,if life of a machine has diminished. Moreover, we need information about the suppliers(as we have in this case study). It helps us to predict if data from certain suppliers is always faulty.

To keep a check on raw material and suppliers, we can get data about the external ratings of the supplier.This gives an idea about their market reputation.This data is crucial in failure analysis.

From a quality control perspective, control charts play a very important role in detecting failures. They display the limit of statistical variability that can be explained as normal. If our parameter of interest(like parameter 7 in this case study) performs within these limits, it is said to be in control. In such a case, problem could be prevented at one particular level.